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Abstract 

This paper assumes prior detections of multiple targets at each time instant, and uses a graph- 
based approach to connect those detections across time, based on their position and appearance 
estimates. In contrast to most earlier works in the field, our framework has been designed to 
exploit the appearance features, even when they are only sporadically available, or affected by a 
non-stationary noise, along the sequence of detections. This is done by implementing an iterative 
hypothesis testing strategy to progressively aggregate the detections into short trajectories, named 
tracklets. Specifically, each iteration considers a node, named key-node, and investigates how to 
link this key-node with other nodes in its neighborhood, under the assumption that the target 
appearance is defined by the key-node appearance estimate. This is done through shortest path 
computation in a temporal neighborhood of the key-node. The approach is conservative in that 
it only aggregates the shortest paths that are sufficiently better compared to alternative paths. It 
is also multi-scale in that the size of the investigated neighborhood is increased proportionally 
to the number of detections already aggregated into the key-node. The multi-scale nature of 
the process and the progressive relaxation of its conservativeness makes it both computationally 
efficient and effective. 

Experimental validations are performed extensively on a toy example, a 15 minutes long 
multi-view basketball dataset, and other monocular pedestrian datasets. 

Keywords: multi-object tracking, graph-based formalism, hypothesis testing, unreliable 

features, sporadic 
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1. Introduction 

Multi-object tracking (MOT) is a fundamental issue in computer vision. It supports high- 
level semantic scene analysis in numerous and various applications. Vehicle trajectories are, for 
example, collected to control traffic monitoring solutions IT]. People displacement analysis is 
important to improve the security of public spaces 12, or to understand sport actions ||2- In 
microscopy, tracking of cells helps to understand biological processes 0. 
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1.1. Detection-based MOT problem formulation 

Due to recent improvements in object detection, many detection-based approaches have been 
proposed to handle the MOT problem. In such approaches, plausible object locations are first 
estimated in each individual frame and some features, characterizing the appearances of the 
detected objects, are extracted. Afterwards, the MOT problem is formulated as the problem 
of grouping these detections into a minimum number of disjoint trajectories, each trajectory 
corresponding to a single physical entity. This data association problem is usually handled by 
graph-based solutions. First, a graph is defined to connect a set of nodes that correspond to 
the detections (or unambiguous association of detections, named tracklets). Each edge gets a 
weight that reflects either distance (or dissimilarity) or similarity in terms of spatio-temporal 
displacement and/or appearance between the two nodes it connects. Afterwards, multi-object 
tracking can be formulated in its general form as the problem of partitioning the graph into 
disjoint sets T,, / > 0 of nodes El such that 

• each set contains one and only one detection at each time instanj^ 

• each detection is included in one and only one of the set^ 

• the elements of a set are consistent in terms of appearance and spatio-temporal features, 
and 

Formally, this can be written as 

minimize C(Ti), 

subject to T: n Tj - 0, Vi V /', 

( 1 ) 

Vi > 0, and Vm, v & Ti : tu ty, 

Vi > 0, and Vf, 3u e V/ with f„ = t. 

where 'V represents the set of all nodes, C(7’;) represents the dissimilarity cost within the i-th 
set Tj, and f„ represents the associated time of node u. In Equation Q, the first two constraints 
require that the sets {T,)^j define a valid partition, whereas the last constraint requires that Tj 
cannot have multiple detections from the same time instant. The cost CjT,) should be defined 
such that it decreases (increases) when the detections in T, have a small (large) dissimilarity 
between them, reflecting (in)consistent associations. The quality of the solution relies on the 
definition of C(7’,). Ideally, if there are n nodes within T,, the dissimilarity function should 
consider all n(n - l)/2 of time-causal pairs of nodes, and associate to each of them a cost that 
increases with the likelihood that the nodes in a pair correspond to two distinct targets. That is, 

C(r,) := 2 (2) 

u,veTi 

ui=v 


^Potential missed detections (MD) or appearing/vanishing targets are typically handled based on virtual nodes. The 
inclusion of such a virtual node is a set 7/ induces a penalty to avoid selecting the virtual node option if the frame includes 
a proper detection. 

^False positive detections (FP) are typically handled by defining a false detection set Tfp that gathers all detections 
that are not part of a trajectory set 7/. A false positive penalty is assigned to each element in Tfp to avoid the inclusion of 
correct detections to this set. For simplification purposes, we ignore MD and FP in the rest of this introduction section. 
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where w„,, is defined to decrease with the likelihood that the nodes u and v correspond to the same 
physical object in terms of space, time and/or appearance. Typically, should increase with 
the appearance dissimilarity and the spatial distance between nodes u and v. More importantly, 
its definition should also account 

1. for the time elapsed between u and v: a larger time interval makes it more likely that 
a target has moved or changed in appearance, hence reducing w„,, for a given observed 
dissimilarity 

2. for the confidence we have in the observations; an unreliable feature should not lead to 
definitive conclusion about whether the nodes correspond to the same target or not. 

In the following, we refer to these two principles as time evanescence and reliable feature 
prominence respectively. 

1.2. Previous art simplification and related issues 

Given the definition of C(Ti), provided in Equation (|^, solving Equation Q rapidly becomes 
computationally intractable. As pointed out by Zamir et al. 0, the problem becomes equivalent 
to the travelling salesman problem, which is known to be NP-complete. Therefore, most previous 
works build on the time evanescence principle described above, namely on the fact that 
should only be large for nodes that are close in time, to simplify the problem. Specifically, they 
ignore dissimilarities between far away nodes and only consider for each node u the cost 
induced by its immediately subsequent node v* in T,. Eormally, 

C{Ti):^ 2 w„,*, (3) 

M,V*Gr, 

where v* argmin,,gj. ~ lu) is the node in Ti that is temporally the closest to u. 

Doing so. Equation ([T]) becomes easy to solve, since it basically reduces to finding a set of 
paths with a minimal cumulative cost. This can be solved by using a greedy shortest-paths com¬ 
putation 0 , or by running the K-shoitest paths (KSP) algorithm 171. Apart from the KSP, several 
other algorithms such as network flow algorithm l8], robust hierarchical association ||9l can be 
envisioned to estimate the K tracks under this simplification assumption. These approaches have 
been proven to be effective in a variety of scenarios for which the prominence of the links con¬ 
necting close observations is valid in many practical association problems. 

This simplification, however, fails to correctly model the tracking problem when the cost 
Wuv of the links that connect nodes that are distant in time becomes important compared to the 
links between subsequent observations. This typically happens when discriminant features are 
observed with variable level of reliability along the time. In this case, due to the reliable features 
prominence principle, the cost Wuv becomes relatively more significant (either smaller or larger 
depending on whether the nodes are similar or not) between far away, but reliably observed, 
nodes than between close nodes with noisy features. Such cases are prevalent in numerous prac¬ 
tical scenarios. Eor example, color histograms appear to be quite noisy in presence of occlusions, 
and in some other cases, highly discriminant appearance features are only available sporadically 
(and under certain configurations only). Eor example, in sports, a number on a jersey is visible 
only when facing the camera. 

In such time-varying observation processes, the task of tracking multiple objects, while tak¬ 
ing into account the position and all the available appearance features, cannot be addressed prop¬ 
erly with the formulation in Equation 0- This is due to the fact that the consistency of a track 
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Figure 1: Problem of conventional tracking method in presence of sporadic appearance features, (a) Detections 
and trajectories corresponding to two targets (red and green) for 5 consecutive frames are shown. Gray nodes do not 
have appearance features. For readability, (b) only depicts a subset of edges of the fully connected graph considered in 
Equation (see text for details), (c) Conventional tracking algorithms, i.e., with the time-evanescence simplification 
assumption, fail to track the target coirectly, and result in appearance inconsistencies along a track, (d) Given the 
appearance of the key-node a, it is possible to simply increase (respectively, decrease) the cost of going through the 
nodes that are dissimilar (respectively, similar) in the graph irrespective of whether the nodes are temporally close or far. 
The resulting shortest-path, shown by thick blue arrow, from a is consistent with all the available appearances. Best 
viewed in color. 


cannot be measured by the mere accumulation of (dis)similarities between the consecutive nodes 
in the track, simply because the appearance features might be unreliable or even purely unavail¬ 
able in some nodes. This major shortcoming of conventional graph-based tracking is illustrated 
in Figure [II 

FigureFJa) depicts the ground-truth trajectories of a red and a green target, as well as the 
appearance observed in each time frame for each of the target. The color of the node indicates 
whether the color of the target is available (red or green) or unavailable/unreliable (gray). The 
problem, defined by Equations Q and Q, is depicted in Figure [TJb). Edge cost is zero when 
connecting two nodes with the same color, intermediate (and function of spatio-temporal mea¬ 
surements ) when the color information is lacking for one of the nodes, and inhnite when the 
connected nodes have distinct colors. Eor readability, only the edges connecting the detections 
that are observed at consecutive times are depicted (in black), plus the edges with inhnite weight 
(in red). Other w„,, are negligible due to the fact that Wuv has to decrease as time elapses between 
u and V (time evanescence principle discussed above). The solution to problem 0. computed 
from this graph, using exhaustive search approach, corresponds to the desired tracks and is de¬ 
picted in Eigure [TJa). In contrast, making the simplihcation assumption presented in Q and 
thus omitting all links between non-consecutive nodes, fails to track the target correctly. This is 
depicted in Eigure [^c), where we observe that a conventional (/r-)shortest approach ends up in 
associating red and green nodes. 

Interestingly, we also observe from this toy-example that, if we were specihcally interested 
in tracking the green target observed in the node a depicted on the top left of Eigure[TJd), a trivial 
solution would consist in increasing/decreasing the cost of an edge when it enters a red/green 
node, wherever they occur along the track. In that way, the shortest-path to connect node a to the 
window extremity would become consistent with the color observations. 
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In this paper, as a primary contribution, we propose to extend this trivial single-target track¬ 
ing solution to a multi-object tracking context, in which no prior knowledge is available about 
the actual appearance of the targets, and in which the appearance measurements are subject to 
noise that changes over time (non-stationary), but whose relative importance is known as a prior. 
In practice, this prior is typically derived from the detector (which might reveal an occlusion 
that hampers the appearance observation) or from the feature measurement process (e.g., a digit 
recognition algorithm might conclude that no digit is visible or that its recognition is quite un¬ 
certain). 


7.3. Contribution 

To circumvent the limitations of conventional algorithms, we propose a new paradigm to ag¬ 
gregate detections into objects trajectories. It extends the trivial solution depicted in Figure [TJd) 
by emebdding shortest-paths computations within an iterative hypothesis testing (IHT) strategy. 

Each iteration of the algorithm works as follows. A node, named key-node (node a in Fig¬ 
ure Ed)), is selected to define a target appearance hypothesis. Given this hypothesis, a shortest- 
path algorithm is considered to investigate how to aggregate the key-node with its temporal 
neighbors in the graph, while promoting the nodes that share its appearance, just as for node 
a in Figure [Jd). The process is repeated iteratively, each node possibly becoming a key-node 
at some step of the algorithm. To avoid misleading the overall multi-object tracking process 
due to a wrong intermediate aggregation decision, e.g., caused by some inappropriate appear¬ 
ance hypothesis, the shortest-path connecting the key-node to its neighborhood is only validated 
when it is ‘sufficiently shorter’ than alternative paths. The criterion to validate the shortest-path 
is very strict in the beginning of the iterative process but is then progressively relaxed as the 
iterations proceed. This progressive relaxation makes the process greedy in the sense that most 
reliable tracklets will be extracted first, independently of the order in which nodes are scheduled 
as key-nodes. 

Another worthwhile design choice consists in adapting the observation window to the size 
of the key-node {i.e., number of detections already aggregated into the key-node), making the 
process multi-scale. The advantages are two-fold. First, it reduces complexity by aggregating the 
nodes locally before considering larger observation windows. Second, it gives the opportunity to 
investigate long time horizons based on more reliable appearance information (since appearance 
has been accumulated on many frames for large key-nodes), which benefits the tracking accuracy. 

The proposed approach has the advantage of naturally accounting for different levels of relia¬ 
bility in the observation process, typically by giving more credit to the reliable appearance mea¬ 
surements when defining the cost associated to the discrepancy between the target appearance 
hypothesis and a node appearance estimate. Hence, the algorithm becomes able to effectively 
exploit sporadic features or features whose reliability varies over time, which is a significant step 
forward compared to the state-of-the-art. 

Compared to our initial work presented in IfTOll . this journal paper; 


positions our algorithm with respect to the generic MOT formulation defined by Equations 
Q and Q, revealing that a larger range of practical problems can be addressed our solu¬ 
tion, compared to the ones supported by conventional simplification adopted in Equation 
0 , 

introduces the progressive relaxation concept, which is a valuable extension compared to 
our conference paper since it avoids the tedious and hazardous tuning of the thresholds 


associated to the validation criteria (see the testing phase in Section 4.2.2 1 , 
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• extends the off-line algorithm presented in IflOl to on-line tracking scenarios (see Section 

• releases a public reference software implementation of our algorithn:|^ 

• provides extensive validations both on synthetic and real-life data, which helps in assessing 
the practical usability and relevance of our approach. 

The rest of the paper is organized as follows. Section [^presents the (few) methods that have 
been previously introduced to handle sporadic features or features with time-varying reliability. 
Section [^defines the graph terminology. Our iterative hypothesis testing algorithm is described 
and discussed in Section Section extends our approach to on-the-fly incremental tracking 
scenarios. Section|^presents the experimental results, and demonstrates the efhciency and effec¬ 
tiveness of our approach both on a synthetic and a real-life datasets. 


2. Related works 


In this section, we review the few works that have been proposed to address the multi-object 
tracking in presence of features that are sporadic and/or affected by non-stationary noise. 

The global appearance constraint (GAC) approach IfTTI assumes a prior knowledge of a dis¬ 
crete set of N possible appearances, and ends up in computing /T-shortest paths in a A-layered 
graph, K being the number of targets, and N corresponding to the number of possible target 
appearances. In contrast, to avoid the computational burden associated to the construction of a 
A-layer graph, and to handle cases for which the possible set of appearances is not a known and 
finite discrete set, we embed the hypothesis testing within an iterative local aggregation frame¬ 
work. We show in our validation that this results in signihcant accuracy improvements. 

The discriminative label propagation (DLP) approach m presents an elegant method to 
combine various appearance and spatio-temporal relationships between the detections by con¬ 
structing a number of complementary graphs, and assigning labels to these detections in a man¬ 
ner that is consistent with all graphs. This approach only handles sporadic appearance features, 
and can thus not take advantage of continuous features reliability priors. 

Zamir et al. 13 adopt a similar formulation than the one defined in Equation but apply it 
on short (typically 50 frames long) segments, for computational tractability. The solution on 
each segment is obtained by solving a generalized minimum clique problem (GMCP) based 
on a greedy heuristic. The same procedure is repeated in a hierarchical manner to generate long 
trajectories. In addition to the sub-optimality of the GMCP solution, a drawback of this approach 
lies in the fact that the decision have to be taken at each level of the hierarchy before moving to 
the next level. As a consequence, lower level decisions (derived from only partial information) 
might be wrong and impact the final solution. In contrast, our approach works conservatively 
and does not force decisions on small observation windows when those decisions are ambiguous 


(testing phase in Section 4.2.2 1 . 


3. Graph formalism and notations 

As an input, the algorithm receives the set of candidate targets, detected independently at 
each time instant as described in IIT3l . Apart from the detection time t and the location x, the 


‘http://sites.uclouvain.be/ispgroup/index.php/Softwares/HomePage 
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detector computes N appearance features /, (1 < i < N) for a target. Since a feature might be 
unreliable or even missing, the detector outputs a confidence value c,- 6 [0,1] for each feature 
(ci - 0 standing for a missing feature). A detection d is therefore characterized by the vector 

d - {t,y,T, c). 


where T - {f ■ ■ ■ ,/a^) c - {c\, ■ ■ ■ , c^). The set of detections at a given time t is denoted 
as £)'. As introduced earlier, the proposed algorithm adopts a graph-based formalism. We define 
a graph ^ = (^,6,W)hy: 

• a set of nodes, with each node corresponding to a tracklet, i.e., 

• a set of edges, & c 'V x'V, defining the connectivity between the nodes in 'V, 

• and a set of weights, W ; £ —> IR+, weighting these nodes and edges. 


Initially, individual detections define the nodes of the graph. Detections are then aggregated 
into tracklets, which define the nodes of the updated graph. The proposed iterative aggregation 
process is presented in details in Section [4~2| including the definition of cost and edges between 
nodes. Here, we only introduce the associated terminology. Formally, a tracklet v is defined to 
be collection of chained detections, i.e., v - (d\ d^ , ■■ ■ , |v| being the length of the tracklet. 

Notice that the chain is ordered in the sense that the detection times i 6 [1, |v|] are such that 
4^' - id' < icP < ' " < frfi'i = \ with and respectively denoting the starting and ending 


time of the tracklet. 

Notice that pairs of tracklets are connected only between their extremities in a way that 
maintains the increasing ordering of the detection times composing the two tracklets. The weight 
Wuv is introduced to denote the linking cost between two nodes u,v e'V. It is formally defined in 
Section 4.1 In short, it typically decreases with the likelihood that the nodes u and v correspond 
to the same physical target. In addition, we introduce the inner cost w, Qof a node v to denote 
the cost of traversing tracklet v from its starting time to its ending time. It is introduced to avoid 
that long nodes create short-cuts in the graph. Since the edges are directed and “time-forwarded” 
(see Section |4T| , the graph 0 is directed and acyclic (DAG), and permits only causal traversals. 
Nevertheless, the graph can be globally reversed in order to allow anti-causal paths for processing 
purposes. We denote such reversed graph as . 

In the sequel, we use two more graph notations. First, 0s represents a windowed-graph 
formed by selecting in ^ = ('V, £, W) the tracklets v e 'V having at least one extreme time 
component inside the temporal window 6. The connectivity & and the weight W are restricted 
accordingly from these selected tracklets in order to form &s and Ws respectively. Second, in 
case of incremental tracking, the algorithm incorporates new detections at each time instant t 
and the graph is continuously incremented with time. We denote the graph at time f by The 
corresponding vertices and edges are denoted by "V and G respectively. 

Figure [^depicts how the tracklets are gathered into a graph in the proposed framework. 


4. Iterative hypothesis testing algorithm 

This section first explains the construction of graph. Afterwards, it presents our proposed 
algorithm, and outlines its characteristics. 


'*Note that w„ is not the self-loop of v. 
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Figure 2: Graph formalism for iterative hypothesis testing. The k-\h detection at time t is denoted by rfj,. They are 
aggregated into tracklets. Each node corresponds to a tracklet. An edge that connects two nodes u and v has a cost w„v 
The windowed-graph Qg is comprised of the nodes and (blue) edgeswithin the observation window 6. 

4.1. Graph construction 

As introduced earlier, our nodes coiTespond to tracklets. We create a directed edge from u to 
V only if 0 < - 1^^ < T^ax, i e., node v occurs after u and the time interval is smaller than Tmax- 

The weight of the edge is dehned solely by the spatio-temporal displacement between u and 
V, i.e.. 


/ [1 +r ■ - f?’ - l)ksp(M,V’) if 0 < < T„,ax, ,,, 

\ oo Other-wise ^ ’ 

where the factor y > 0 introduces penalty for missed detections, and gsp(u, v) measures the 
distance between v and the predicted position of the object corresponding to node u. It is dehned 
as 

^sp(«, V) := -yW - - t^:%, (5) 

where the term is the velocity, at the end of tracklet u. It is zero for unit length tracklets, and 
is computed from the last 2 detections of the tracklet otherwise. Since the edges are directed and 
“time-forwarded”, the graph ^ is directed and acyclic (DAG). 

4.2. Iterative hypothesis testing 

Our major objective is to design a detections aggregation method that is able to exploit ap¬ 
pearance cues even when they are sporadic or have variable reliability. Therefore, we promote a 
novel paradigm, founded on an iterative hypothesis testing process. 

Overview of the contribution In this approach, each iteration selects a node, named key- 
node, and computes the shortest-path to connect this key-node to the extremity of either a forward 
or backward neighborhood, under the assumption that the observed key-node appearance dehnes 
the reference appearance of the tracked object. Given this hypothesis, paths that go through 
nodes that do (not) share the key-node appearance are promoted (penalized). This is done simply 
by decreasing (increasing) the cost to go through a node of the graph when the appearance of that 
node is similar (different) to that of the key-node. Hence, all appearance cues, even the sparse or 
inaccurate/unreliable ones, can be exploited to drive the selection of aggregated paths within the 
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graph. Since the process is repeated with each node being the key-node, all observed appearance 
hypotheses are examined. 

Two subtle mechanisms largely contribute to the success of our approach: 

• The first and primary one lies in the conservativeness adopted to turn the key-node shortest- 
path into a single tracklet node for subsequent iterations. Actually, this path is only vali¬ 
dated if it is sufficiently better than alternative paths. Importantly, the notion of ‘sufficiently 
good’, which is formally defined in Section [4.2.2| below, is progressively relaxed along the 
iterative process. This makes the overall algorithm greedy, in the sense that the less am¬ 
biguous paths are validated first, thereby making the solution reasonably independent of 
the order in which nodes are scheduled as key-node and appearance hypothesis are tested; 

• The second one consists in defining the size of the key-node neighborhood proportionally 
to the length of the key-node. This makes the aggregation multi-scale, which benefits 
both the accuracy and the computational efficiency, since the individual detections get the 
opportunity to be aggregated into tracklets before investigating large time horizons, leading 
to less nodes and more accurate appearance estimation on large time frames. See Section 

Em 

The global flow of our proposed iterative aggregation algorithm is presented in Algorithm[T] 


Algorithm 1 Iterative Hypothesis Testing 
Require: Graph ^ — ('V, £, W), number of iterations MAX_ITER 
Ensure; Updated graph after MAX_ITER iterations 
Procedure: 
dir <—hi 
Initialize Afj" and 
forl= I,-- - ,MAX_ITERdo 
Initialize: ^ 

while ■?? 0 do 

rkey <— Schedule('R) 

Vagg <— HypothesisTesting(^, Vkey,ffi>, Afj') 

if Vagg ^ Vkey then 
g <- Simplify(^, Vagg) 

end if 

'R<^'R\ Vagg 

end while 

^ Relax(A:®,A:®) 

dir <- dir 

end for 


As controlled by the dir flag, the direction of investigation changes at each graph-scanning 
iteration to propagate the key-node appearance hypothesis both forward and backward, thereby 
making the global process symmetric with respect to time. 

In Algorithm [T] the function Schedule selects a node for hypothesis testing that has not yet 
been scheduled during the on-going scanning of the graph. In this paper, we select the nodes in 
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decreasing order of their lengths because long nodes are more likely to have accumulated reliable 
appearance information. Our experimental results have shown that the node scheduling strategy 
does not affect the performance much. 

The remainder of this section details the practical implementation of the core of our proposed 
HypothesisTesting strategy. It is detailed in Algorithm]^ and involves both (i) the computation 
of the shortest-path connecting the key-node to its neighborhood, under target appearance hy¬ 
pothesis, and (ii) the validation or rejection of this path as a tracklet for subsequent iterations of 
Algorithm [T] 

4.2.1. Hypothesis: multi-scale tracklet aggregation 

Formally, the key-node is denoted Vkey It is selected among the set of nodes, K, that have not 
yet been investigated during the current scanning of the graph. The aggregation of the key-node 
with its neighbors is then investigated in an observation window that precedes or follows the key- 
node, depending on the sign of the dir flag. The size of the observation window is proportional to 
the length of the key-node. We use 6 to denote the observation window interval and |(5| to denote 
its size. Hence, 5 = +'^'IWey|] in the forward mode (liir = 1), or^ = ■ Ivkeyl, f^y] 

in the backward mode (dir = -1), where a- e K+ is the window proportionality constant. 


Algorithm 2 HypothesisTesting 

Require: Graph key-node direction flag dir, validation parameters Ki , K 2 
Ensure: Nodes that can be aggregated Vagg 

Procedure: 

(5 <— Limits of observation window [See text} 

Qs «— GraphHypothesis(^, 6, Vkey) 

(Sb,S sb) Shortest- and second shortest-paths from Vkey 
if isUnambiguous(5*, 5sb) then {Refer to Figure^^or illustration ) 

0^ <— ReverseDirection(^b) 

{Sb',Sstr) Shortest- and second shortest-paths from v* 
if isUnambiguous(5i',5sb') then 

^agg ^ S h 

end if 
else 

^agg ^ t^key 

end if 
return Vagg 

isUnambiguous(5 b,Ssb) 

return cost(5b) < Ki ■ |5| and cost(S b)/cost(S sb) < Ki 


Given the key-node Vkey and the observation window 6, we define a graph to investigate 
how the key-node can be aggregated with its neighbors to define an appearance-consistent path 
under the assumption that the target appearance is defined by the key-node appearance. The 
graph Qs is directly derived from the graph Q, by cutting Q according the limits of the observation 
window, and updating the inner costs of the nodes within the window to reflect the hypothesis 
made about the target appearance. In short, the inner cost Wy of a node v 6 'Vg is increased 
(decreased) if it has a different (similar) appearance than the one of the key-node. 
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In more details, the inner cost is updated as follows. First, an appearance is associated to the 
tracklet v. The inference of the tracklet appearance from its individual detections appearances 
directly depends on the characteristics of the appearance observation process. If, for example, 
the observation process is affected by outliers, a RANSAC m approach would be appropriate 
to capture the right tracklet appearance. In contrast, if the observations are independent and af¬ 
fected by Gaussian noise, then a weighted average provides an appropriate inference mechanism. 
Here, we use a weighted average for the tracklet appearance as an example of possible practical 
implementation. Then, the average feature of a node v is computed as 


/ ^ ^ (v) Ay) 

■’ ' (-(v) Zj •' ’ 


f=l 


(6) 


where ■ 

—(key) —(y) 

Given the key-node and tracklet v appearances and respectively, let D(y) denote the 
value by which the inner cost of node v is incremented due to its dissimilarity with respect to the 
key-node appearance. We define 






(7) 


i=\ 


where T, weights the contribution of the /-th feature. The parameter is introduced such 
that it tends to one (zero) for confident (unreliable) features. As an example of practical imple¬ 
mentation, we define it as; 


'0 ifCf^<C™i„, 

1 ifCf'>C™ax, (8) 

C^'’^-Cni-n 

' — otherwise. 

^max ^min 

where Cmin and Cm^are the limits to define if the feature is considered reliable or not. 

From EquationjQ when ap^^aP -a 1, wP -a /i;||/| - f \ ^||j and when ap^^aP -a 0, 

^ The term is introduced so that a node that definitely looks similar to the 

key-node (Z)(v) a; 0) is favored compared to a node for which no appearance features is available 
(D(v) ^ Empirically, we set - 5 for all 1 < / < A. 

After the inner costs of the nodes have been incremented by D{v), the shortest-path 5^, to 
connect the key-node to the extremity of the observation window is computed. Thanks to the 
directed and acyclic nature of the graph, the shortest-path computation can exploit the inherent 
topological ordering of the nodes {e.g., to support a depth-first search) which is more efficient 
that the Dijkstra’s algorithm. The cost of a path is defined to be the sum of costs of the edges and 
the inner costs of the nodes along it, and is given by the function cost in Algorithmic 

Even though it seems that updating the costs requires additional scanning of the graph, it is 
mitigated by the concept of visitors in the shortest-path algorithm of the Boost Graph Library. 
The visitors allow to update the costs of the nodes or edges on-the-fly as they are manipulated 
during the shortest-path computation. 
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4.2.2. Testing: path ambiguity estimation and tracklet validation 

Since the cost of the edges have been defined to take the displacement as well as the appear¬ 
ance into consideration, the shortest-path S which connects the key-node to the extremity of 
the observation window, reasonably corresponds to a single physical object (same appearance, 
and consistent displacements) and could thus be aggregated into a single node. 

However, to limit the risk of connecting nodes that correspond to two distinct objects, we 
check the level of ambiguity of the shortest-path by comparing its cost to the costs of alternative 
paths. Figure [^illustrates this process. 


& 



Figure 3: Illustration of the validation of the hypothesis. Within the window, the best (thick an'ow) and the second 
best (thin an'ow) paths (denoted by 5^ and 5sb respectively) are searched. Blue and red arrows represent forward and 
backward directions respectively. Best viewed in color. 

It runs in two steps. In the first step, the shortest Sb and the second shortest Ssb path^are 
considered. The ends of the best and second-best paths are denoted as Vb and Vsb respectively. The 
shortest-path 5* is considered being unambiguous only if two conditions are met: (i) cost(5b)< 
Ki ■ |(5|, and (ii) cost(5b)/cost(5sb)< K 2 . 

If all conditions are met, the second step of the validation process is considered. For this, 
the graph is reversed by flipping the direction of all the edges of 0s- It is mentioned as Re- 
verseDirection in the algorithm. The shortest- {S and second shortest- {Ssw) paths linking Vb 
with the opposite extremity of the observation window are then computed. If S ly leads to the 
original key-node, i.e., if Vb' = v^ey, and if a similar set of conditions hold for S b' and Ssw, then 
the path 5* is considered to be unambiguous, and is replaced by a single node in the graph for 
subsequent iterations of the IHT. This procedure is called Simplify in the Algorithm[T] It updates 
the appearance features of the node as in Equation [^ and also the motion parameters. It keeps 
only the edges connecting the extremities of the aggregated path to the rest of the graph. Other 
connections involving intermediate nodes are removed. 

Choosing small (large) values of Ki and K 2 makes the constraint more (less) conservative. 
In the first iterations of the algorithm, we start with small values of Ki and K 2 . As the iterations 


^In our implementation, the second best path 5sb is chosen to be a path that does not overlap the shortest-path 
Sh- When there exist paths with smaller costs that partly overlap with Sb, the unshared part of Sb remains subject to 
ambiguity, even if S'sb is large. Hence, only the part of Sb that is shared with those alternative paths should be considered 
for aggregation. For details, refer to the reference software at http://sites.uclouvain.be/ispgroup/index. 
php/Softwares/HomePage 
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proceed, we progressively relax the validation criteria. This makes the overall IHT algorithm 
greedy, in the sense that the less ambiguous paths are validated first, thereby making the solution 
reasonably independent of the order in which nodes are scheduled as key-node and appearance 
hypotheses are tested. This progressive relaxation of the key-node path validation constraint is 
denoted by the function Relax in Algorithmic An example of relaxation scheme is described in 
results section. 

5. From off-line to incremental IHT 

Because we iterate over the nodes, our IHT naturally extends to the incremental scenarios in 
which the detections arrive sequentially over time. Compared to the off-line approach, there are 
however few subtleties. They are: 

• Incrementing the graph: At time f = 1, the graph is just a set of detections at that instant, 

i.e., - (£)*,0). At time f > 1, the graph is obtained by adding new detections £)' to 

the so-called previous graph resulting from earlier steps of the algorithm, up to time 
f - 1. All nodes ending later than time t - r^ax are linked to all the current detections. The 
weight of each edge is computed as in Equation 

• Scheduhng of the nodes: Unlike the off-line approach, we schedule the ‘recent’ nodes 
first. This is done to prevent the fast growth of the graph at each time. Specifically, 
we schedule the nodes in decreasing order of |v|/max{l, f - so that the ‘recent’ and 
‘sufficiently long’ nodes are selected first. 

• Relaxing the validation criteria: We maintain a ‘sliding window’ [f - 5siide, t\ where dsude 
is the length of the sliding window. Inside (respectively, outside) the sliding window, we 
impose conservative (respectively, relaxed) criteria for and K 2 . We use (5siide = 200 
frames. 

6. Evaluation 

We test our proposed IHT algorithm on a toy example and also on the real-life APIDIS lITSl . 
PETS lfl6l and TUD ifTTl datasets. The toy example helps us to highlight the various steps of our 
IHT aggregation paradigm, while the experiments on real-life examples demonstrate the practical 
relevance of our approach. 

The proposed approach has been implemented in C-i-H- (for APIDIS dataset) and MATLAB|^ 
(for toy example and PETS dataset). The C-n- implementation utilizes Boost Graph Library 
for representing the graph. The DAG shortest path algorithm is provided in the library. All 
experiments are performed on a desktop computer with 3GHz quad-core CPU, 4 GB of RAM, 
and running under Linux. 


®The MATLAB implementation is available at the http://sites.uclouvain.be/lspgroup/index.php/ 
Softwares/HomePage 
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6.1. Evaluation metrics 

We use the CLEAR MOT metric ifTSll to evaluate our approach. It defines two quantities, 
namely the multiple object tracking precision (MOTP) and the multiple object tracking accuracy 
(MOTA). 

MOTP is defined as the average error in estimated position of pairs of matched ground-truth 
and estimated track pairs. MOTA is defined to decrease proportionally to the number of missed 
detections, false positives, reinitializations and switches (see ifTSl for the formal definition). The 
error due to switches is usually more problematic since it affects the higher level interpretation 
of the tracks. 

Since MOTP depends on the accuracy of target detector and on the accuracy of ground-truth 
accuracy, MOTA is often preferred over MOTP. Since switching errors are important, we also 
report the overall switching etTors (SW). 


6.2. Datasets 

Toy dataset. This dataset is considered to observe how our algorithm compares to related 
works, but also to assess its sensitivity to parameters selection. We consider 3 targets whose 
ground-truth locations {yi,y2>y3) time instances k e {0, • • • ,10) are obtained by 


yi 


50 sin 



, y 2 50 cos 



y3 


-20 - 50 sin 



(9) 


The appearance feature of the i-th target, denoted as /), is modeled by a 2 state automata 
as shown in Figur^ For the i-th target, the appearances of the state 1 and 2 are modelled as 


p 



Figure 4: 2 state automata for modelling the appearanee of the i-th target. 

yV(/t;, cTiow) and AfljU;, CThigh) respectively. We use e {0,120,240) and crjo^ = 10 and cThigh = 
100. We fixq - 0.5 and vary p. The confidence of the feature measurement process is estimated 
as Ci = 0.1 if s = 2, and 0.8 if s = 1. 

APIDIS dataset. This 15 minutes long basketball video dataset has been captured 
by 7 cameras. The candidate detections are computed at each time instant based on a ground 
occupancy map, as described in ina. For each detection, jersey color and digit are considered 
appearance features. The jersey color is computed as the average blue component divided by the 
sum of average red and green components, over the foreground silhouette of the player within the 
detected rectangular box. The digit feature is obtained by running a digit-recognition algorithm 
im in the same rectangular region. The digit feature is inherently sporadic as it is available only 
when the digit faces the camera. 

Pedestrian datasets. We use publicly available PETS S2/L1 and TUD Stadtmitte datasets 
to evaluate the performance of our approach on monocular views. The PETS is a 795 frames 
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long dataset with moderate target density. TUD Stadtmitte is 179 frames long. Because of the 
low view-point, the targets frequently occlude each others. Detection results are obtained from 
II 20 I . At each detection, we compute 24-bin color histogram by concatenating 8-bin RGB color 
histogram^ We ignore the color histogram if the overlap ratio between two bounding boxes 
exceeds 10%. This is done as the histograms are likely to be unreliable in presence of occlusions. 


6.3. Results on the toy example 

For the toy example, the nodes of the graph Q correspond to the detections defined by Equa¬ 
tion 0. For IHT, KSP and GAC, we create edges between the nodes that occur at consecutive 
time instants. The cost w,-; writes w,-; where and are the spatio-temporal 

and appearance costs respectively. The spatio-temporal cost is defined as := ||yy - y,|| 2 . The 
appearance cost differs from one algorithm to another. Given two appearance features f, and fj, 
the appearance dissimilarity dij is computed as dij := 1 - | cos{n{fj - /;)/180)|. Then, we define 
the appearance cost as in Equation 


Algorithm 

Reference appearance 

Appearance cost, 

KSP 

None 

CiCjdij -H (1 - C,Cj)w'*”‘^ 

GAC 

1-th global appearance, // 

Cidii - 1 - (1 - 

IHT 

/-th key-node appearance, /; 

CiCidii H- (1 - qq)w^^’‘* 


where > 0 is a fixed cost, introduced to associate a hxed cost to nodes for which the 
appearance is unknown or unreliable. We use = 10 in our experiments. For GAC, the set of 
global appearances considered by GAC are either known a priori (e.g., provided by oracle), or are 
estimated from the measurements (e.g., using k-means with 3 clusters in our toy-example case). 
For GMCP, we consider the problem in Equationplwith C(7’,) where 

1_| i’#« 

V* = argmin^er, (fy - h,)- We dehne ||y„ - My*||2 if v = v* and 00 otherwise. The 

V^U 


„(a) 


1.2 


this GMCP formulation 


appearance cost is defined similarly to KSP. As told in Section 
is NP-complete. We have followed the authors in 0, and have adopted the popular 2-opt local 
search to solve it. 

Results: In our simulations, we vary the transition probability p from 0 to 0.9 with an in¬ 
crement of 0.1. For each value of p, we generate 100 realizations of the target appearances 
and apply IHT, GMCP, GAC and KSP algorithms. The MOTA obtained with and without prior 
knowledge about appearance measurement reliability, i.e., with and without knowing the state of 
the automata, are shown in Figure]^ 

We observe that taking the confidence of the feature measurement into account helps to dis¬ 
ambiguate the data association. When we do not take into account the confidence information, 
all IHT, GAC and KSP perform similarly, with IHT performing slightly better than the other two 
algorithms. The performance improves significantly when the confidence measure is incorpo¬ 
rated. Surprisingly, the GMCP has the worst performances even though it adopts a close to ideal 
problem formulation. The inferior performance of GMCP can be accredited to the fact that each 
‘track’ is extracted greedily and locally from the set of nodes. Unlike GAC, there is no notion of 
global solution. Unlike IHT, it does not challenge the ambiguity of the extracted track. 


^In a tracklet, a distinct histogram is associated to each extremity, so as to account for progressive target appearance 
changes. 
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Without confidence measurement 


With confidence measurement 



P P 

Figure 5: Performance of IHT, GAC, KSP and GMCP on the toy example with and without taking the feature 
measurement confidence information. Best viewed in color. 


It is worth noting that the performance of GAC is strongly dependent on the prior knowledge 
of the 3 global appearances. The performance in Figure|^indeed appears to degrade significantly 
when the 3 appearances are estimated from the measurements (based on k-means clustering). 

Our IHT algorithm has two distinct steps: (i) node scheduling, and (ii) hypothesis validation. 
To study the importance of these steps, we envision the following set-ups. We schedule the 
nodes either at random or in decreasing order of appearance confidence. In addition, we validate 
the shortest-path either conservatively, as described by Figure or always, meaning that we 
systematically define a new tracklet based on the shortest-path. The results are presented in 
Figure]^ They show that the node scheduling has negligible impact on the performance of the 
IHT algorithm. On the other hand, the conservative validation of the shortest path has a drastic 
influence on the performance of IHT. By comparing Figurej^with Figure]^ we observe that IHT 
performs worse than KSP when we validate the shortest path immediately. This is not surprising 
because IHT investigates on a local section of the graph whereas KSP works globally on the 
whole graph. 



Figure 6: Effect of scheduling of nodes and validation strategy on the performance of IHT. Best viewed in color. 


6.4. Results on real-life datasets 

In this section, we present and discuss the performances of both the off-line and incremental 
IHT. To better compare our method with the literature, we consider two experimental set-ups. 
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The first one discards the appearance features and uses only the spatio-temporal information. In 
contrast, the second one incorporates the appearance features. 

Apart from KSP, GAC and GMCP, we compare our results with several other methods such as 
the discriminative label propagation (DLP) ca (introduced in Section]^, the continuous energy 
(CE) ll22l . the discrete-continuous optimization (D-C) ll23l . The CE and D-C trackers compute 
the most probable tracks by minimizing a combination of energies that reflect the consistency 
with the observed detections locations, the plausibility of the tracks dynamics, the persistence 
of tracks, and the exclusivity constraint between co-existing tracks. In addition, D-C uses cubic 
splines to model the motion of targets and also penalizes the number of trajectories. 

It is to be noted that CE and D-C do not use appearance features. Therefore, we compare 
them with the first version of IHT that does not exploit appearance features. GAC and DLP 
are able to exploit sporadic appearance features only. GMCP, on the other hand, can exploit 
appearance features that can be sporadic or have variable reliability. Therefore, we compare the 
second version of our IHT with these methods. 


Dataset 

Method 

No appearance 


With appearance 

MOTA 

MOTP 

SW 

MOTA 

MOTP 

SW 


GAC im 

72.91 

53.13 

108 

73.07 

53.15 

110 

APIDIS 

DLPIfT2l 

81.25 

57.13 

49 

83.80 

60.01 

45 

IHT (offline) 

76.71 

65.36 

11 

87.91 

64.43 

0 


IHT (incremental) 

75.99 

64.53 

14 

86.82 

65.13 

3 


CE Ea 

60.5 

65.8 

7 

- 

- 

- 


D-C El 

61.8 

63.2 

4 

- 

- 

- 

TUD 

GMCP Q 

- 

- 

- 

77.7 

63.4 

0 

DLP Ifni 

62.6 

73.5 

17 

79.3 

73.9 

4 


IHT (offline) 

62.1 

73.2 

7 

78.5 

73.2 

0 


IHT (incremental) 

61.8 

72.9 

9 

78.3 

73.1 

1 


CE Ea 

81.84 

73.93 

15 

- 

- 

- 


D-C ES 

89.30 

56.40 

- 

- 

- 

- 


GMCP Q 

- 

- 

- 

90.30 

69.02 

8 

PETS 

DLP oa 

82.75 

71.21 

25 

91.01 

70.99 

5 


GAC EH 

80.00 

58.00 

28 

81.46 

58.38 

19 


IHT (offline) 

81.18 

74.53 

9 

85.10 

74.56 

4 


IHT (incremental) 

80.91 

74.48 

11 

84.78 

74.32 

5 


Table 1: Tracking results on the APIDIS (1500 frames), PETS (795 frames) and TUD Stadtmitte (179 frames) datasets. 


Prom Table [T] we first observe that the incremental version performs slightly worse than the 
off-line version. Por APIDIS dataset, our method outperforms KSP and GAC significantly. Even 
though DLP seems to work better than IHT when no appearance features are used, it commits 
significant switching errors. This illustrates the conservativeness of IHT. When the appearance 
features are incorporated, IHT outperforms all other methods. 

In case of pedestrian datasets, IHT seems to perform comparably with other methods. Even 
though the MOTA scores are similar or lower than GMCP and DLP, the number of switching 
errors are significantly lower for IHT, which is an advantage in terms of high-level scene inter¬ 
pretation. In case of TUD dataset, IHT is comparable to CE, D-C and DLP in terms of MOTA. 
However, in case of PETS dataset, IHT performs worse than D-C. The superior performance of 
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D-C in scenarios for which no appearance feature is exploited can be accredited to the fact that 
this approach use higher order motion models. On the positive side, our tracker commits fewer 
switching error. 

The right side of Figure [^compares the performance obtained by IHT when different sets of 
appearance features are exploited. 


Various feature combinations Effect of relaxation of validation criteria 



Figure 7: Components of MOTA metric on a 1 minute long video of the APIDIS dataset for off-line IHT. (Left.) 

Various feature combinations. The MOTA scores for all 4 cases are as follows: (i) no feature: 76.71%, (ii) digit feature 
only: 83.03%, (iii) color feature only: 84.84%, and (iv) both features: 87.91%. (Right.) Effect of different {K \, K 2 ). Red, 
blue and green bars correspond to the ‘least conservative’, ‘most conservative’ and ‘progressively relaxed’ validation cri¬ 
teria respectively. The MOTA scores for all 3 cases are as follows: (i) least conservative: 87.36%, (ii) most conservative: 
78.52%, and (iii) progressively relaxed: 87.91%. Best viewed in color. 


As we can see, the switches and re-initializations are reduced substantially when the appear¬ 
ance features are used. It can also be seen that the digit features, even though they are highly 
sparse, can disambiguate some tracks. 

In order to study the effect of the progressive relaxation of the validation criteria, we first 
fix the values for Ki and K 2 such that the validation criteria are ‘most conservative’ (i.e., small 
values of Ki and K 2 ) and ‘least conservative’ (i.e., large values of and K 2 ). Specifically, we 
set (Ki,K 2 ) - (5,1/4) and (K\,K 2 ) - (30,1/1.1). For progressive relaxation, we then consider a 
linear increase in K\ from 5 to 30 in 50 iterations and increase in K 2 linearly from 1/4 to 1/1.1 in 
20 iterations. We increase K 2 faster than K\ because the primary condition to validate a path is its 
low cost. The results are depicted in Figure]^ We see that relaxing the validation criteria indeed 
helps to improve the tracking results in the sense that it avoids identity switches (just as for a 
highly conservative criteria), while maintaining re-initialization and misses at the level obtained 
with a less conservative criteria. Flence, it keeps the best out of the two criteria. 

To extend our analysis to realistic real-life scenarios (on-the-fly tracking on long sequences), 
we also report the incremental IHT tracking results for the 15 minutes long APIDIS dataset. Figure]^ 
compares the performance obtained when different set of appearance features are exploited. 
There are all together 7460 ground truth positions, i.e., GT=7460. As we can see, the switches 
and re-initializations are reduced substantially when appearance features are exploited. However, 
the false positives increase slightly. 

To study the computational advantages of our multi-scale approach, we estimate the time 
taken by IHT with fixed (specifically, |(5| e (10,50,500)) and adaptive (i.e., \6\ - k ■ |vkeyl) obser¬ 
vation window sizes. The results are shown in Figure]^ We observe that the multi-scale nature of 
the algorithm not only reduces the computational time but also improves the tracking accuracy. 

We complete our results by presenting the performance of incremental IHT algorithm (in 
terms of MOTA components) with respect to some key parameters. They are; 
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Components of MOTA for various feature combinations 


Computational advantage of multi-scale nature of IHT 




Components of MOTA Video length (minutes) 


1 No appearance features 

I Digit feature only 

-|5| = 500. MOTA=87.76% 

-|iS| = 10. MOTA=73.41%-|(J| = 50, MOTA=86.06% 

1 Color feature only 

1 Both features 

-Multi-scale, MOTA=92.2% 

-Real-time limit 


Figure 8: Components of MOTA for various feature combinations and computational advantages of multi-scale 
nature of IHT on a 15 minutes long video. (Left.)The MOTA scores for all 4 cases are as follows: (i) No appearance: 
89.1%, (ii) Digit only: 90.5%, (iii) Color only: 91.2%, and (iv) Both features: 92.2%. (Right.) Time taken by IHT for 
multi-scale as well as fixed-window approaches. Best viewed in color. 


Parameter Description 


Connection horizon for new detections (Section 4.1 Equation 
Missed detection coefficient (Section 4.1 Equati on|^ 


4.2.1 


K Window proportionality constant (Section ' 

(Cmin, Cmax) Lower and upper thresholds to compute the reliability (Equationj^ 
{Ki,K 2 ) Eactors to validate the shortest-path (Section 4.2.21 


The reference working point is defined by (rmax = 120,7 - 3, k - 5,Cmin = 20, Cmax = 
lOOKi ^5,K2 = l/3)forwhichtheresultsare(FP=20,MS=387,RE=113,SW=64,MOTA=92.2%) 
for incremental IHT on APIDIS dataset. In Table only one parameter is changed at a time and 
all other parameters are fixed at their reference values. 




"^max 





7 

(CminTi Cmax) 


30 

60 

120 

240 

1 

2 

3 


4 

5 

(20,70) 

(20,100) 

(20,50) (5,50) (20,30) 

FP 

7 

24 

20 


22 


34 

20 

20 


20 

11 



19 

20 

19 

22 19 

MS 

426 

410 

387 


380 

429 

382 

387 


382 

399 


400 

387 

402 

390 402 

RE 

162 

137 

113 


111 

125 

116 

113 


118 

118 


127 

113 

116 

133 117 

SW 

74 

64 

64 


76 


72 

83 

64 


70 

71 



65 

64 

70 

67 75 




K 





Ki 

Ki 

Relaxed 


1 

3 


5 

7 

2 

5 

15 

30 


1/1.5 2/2 

1/3 

1/5 

FP 

14 

19 


20 


33 

14 

18 


24 

30 


70 

32 

18 

17 

20 

MS 

417 

408 

387 

380 

417 

401 


372 

350 


363 377 

401 

445 

387 

RE 

120 

126 


113 

110 

133 

120 


103 

97 


101 

107 

120 

149 

113 

SW 

62 

64 


64 


73 

41 

60 


69 

78 


97 

81 

60 

58 

64 


Table 2: Effect of (Cmin^ ^max). K\ and Ki on 15 minutes video of APIDIS dataset. For comparison, we also 

present the results for the case in which K\ and K 2 are progressively relaxed. 


From Table we observe that choosing small Cmin and large Cmax result in a more conser¬ 
vative situation, leading to reduced switching errors but increased missed detections. A large 
connection window Tmax typically is more robust to missed detections but is prone to switch¬ 
ing errors. The missed detection penalty y directly affects the missed detection. A small y 
will creates ‘short-cuts’ in the shortest-path. On the other hand, a big y will favor only tempo- 
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rally local detections. Both situations result in decreased performance. The observation win¬ 
dow factor K controls the range in which the key-node appearance hypothesis holds. A small 
(respectively, large) k investigates small (respectively, large) temporal neighborhood around the 
key-node. Consequently, small k results in decreased switching errors at the expense of increased 
misses and re-initializations. Both K\ and K 2 affect the performance. Low (respectively, high) 
values of K\ and K 2 result in less (respectively, more) false positives and switching errors but 
more (respectively, less) misses and re-initializations, allowing us to trade-off the errors. We 
propose two alternatives to choose these parameters depending on the problem at hand. First, 
if the objective is to have conservative tracking in which the resulting tracklets are reliable, it is 
suggested to choose low values of Ki and K 2 . This option is suitable if one envisions to process 
these trackets in the next step so as to stitch them into long trajectories. Second, we propose to 
start with small values of Ki and K 2 and then progressively relax as the iteration proceeds. This 
option is suitable when long (but potentially erroneous) trajectories are preferred. 

7. Conclusion and future perspectives 

This paper proposed a novel framework to associate detections while exploiting unreliable 
and/or sporadic appearance features. It proceeds iteratively, starting with a graph in which each 
node corresponds to a detection. Each iteration then investigates how to connect a node, named 
key-node, to its neighbors, under the assumption that the appearance of this key-node is rep¬ 
resentative of the corresponding target appearance. Unambiguous associations are merged into 
bigger nodes, thereby creating nodes with more reliable appearance cues. This aggregation also 
reduces the size of the graphs, and thus the complexity, handled by successive iterations of the 
algorithm. Defining the size of the neighborhood to be proportional to the size of the key-node 
naturally ends up in aggregating the data at larger time scales once more appearance cues have 
been accumulated along the key-node. Progressively relaxing the ambiguity criterion results in a 
greedy process, that primarily aggregates the less ambiguous paths in the graph. 
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